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ABSTRACT 

Time-frequency representations of audio signals often resem- 
ble texture images. This paper derives a simple audio clas- 
sification algorithm based on treating sound spectrograms as 
texture images. The algorithm is inspired by an earlier visual 
classification scheme particularly efficient at classifying tex- 
tures. While solely based on time-frequency texture features, 
the algorithm achieves surprisingly good performance in mu- 
sical instrument classification experiments. 

Index Terms — Audio classification, visual, time-frequency 
representation, texture. 

1. INTRODUCTION 

With the increasing use of multimedia data, the need for au- 
tomatic audio signal classification has become an important 
issue. Applications such as audio data retrieval and audio file 
management have grown in importance H] fT6l . 

Finding appropriate features is at the heart of pattern 
recognition. For audio classification considerable effort has 
been dedicated to investigate relevant features of divers types. 
Temporal features such as temporal centroid, auto-correlation | 
|2|, zero-crossing rate characterize the waveforms in the time 
domain. Spectral features such as spectral centroid, width, 
skewness, kurtosis, flatness are statistical moments obtained 
from the spectrum 12 1. MFCCs (mel-frequency cepstral 
coefficients) derived from the cepstrum represent the shape 
of the spectrum with a few coefficients 1 13 1. Energy descrip- 
tors such as total energy, sub-band energy, harmonic energy 
and noise energy flT] [12J measure various aspects of signal 
power. Harmonic features including fundamental frequency, 
noisiness and inharmonicity HUTj reveal the harmonic prop- 
erties of the sounds. Perceptual features such as loudness, 
shapeness and spread incorporate the human hearing pro- 
cess {20l [Tol to describe the sounds. Furthermore, feature 
combination and selection have been shown useful to improve 
the classification performance [5 1. 

While most features previously studied have an acoustic 
motivation, audio signals, in their time-frequency representa- 
tions, often present interesting patterns in the visual domain. 
Fig. [2] shows the spectrograms (short-time Fourier represen- 
tations) of solo phrases of eight musical instruments. Spe- 



cific patterns can be found repeatedly in the sound spectro- 
gram of a given instrument, reflecting in part the physics of 
sound generation. By contrast, the spectrograms of differ- 
ent instruments, observed like different textures, can easily 
be distinguished from one another. One may thus expect to 
classify audio signals in the visual domain by treating their 
time-frequency representations as texture images. 

In the literature, little attention seems to have been put 
on audio classification in the visual domain. To our knowl- 
edge, the only work of this kind is that of Deshpande and his 
colleges |3|. To classify music into three categories (rock, 
classical, jazz) they consider the spectrograms and MFCCs 
of the sounds as visual patterns. However, the recursive fil- 
tering algorithm that they apply seems not to fully capture 
the texture-like properties of the audio signal time-frequency 
representation, limiting performance. 

In this paper, we investigate an audio classification algo- 
rithm purely in the visual domain, with time-frequency rep- 
resentations of audio signals considered as texture images. 
Inspired by the recent biologically-motivated work on ob- 
ject recognition by Poggio, Serre and their colleagues (14], 
and more specifically on its variant |19| which has been 
shown to be particularly efficient for texture classification, we 
propose a simple feature extraction scheme based on time- 
frequency block matching (the effectiveness of application of 
time-frequency blocks in audio processing has been shown in 
previous work I ITl fTSll ). Despite its simplicity, the proposed 
algorithm relying only on visual texture features achieves sur- 
prisingly good performance in musical instrument classifica- 
tion experiments. 

The idea of treating instrument timbres just as one would 
treat visual textures is consistent with basic results in neu- 
roscience, which emphasize the cortex's anatomical unifor- 
mity 191 |2l and its functional plasticity, demonstrated exper- 
imentally for the visual and auditory domains in |15|. From 
that point of view it is not particularly surprising that some 
common algorithms may be used in both vision and audi- 
tion, particularly as the cochlea generates a (highly redun- 
dant) time-frequency representation of sound. 
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2. ALGORITHM DESCRIPTION 

The algorithm consists of three steps, as shown in Fig. [T] 
After transforming the signal in time-frequency representa- 
tion, feature extraction is performed by matching the time- 
frequency plane with a number of time-frequency blocks pre- 
viously learned. The minimum matching energy of the blocks 
makes a feature vector of the audio signal and is sent to a clas- 
sifier. 



Block matching 
E=||S-B,||2 



E[(,k,1] 



S[(,k] 



Time (() 



Block matching 
E=l|S-B„|p 



E[t,k,M] 



C[ll 



C[M| 



I Training ' 
> Data ' 



Classifier 



Fsaturss 



Fig. 1. Algorithm overview. 



2.1. Time-Frequency Representation 

Let us denote an audio signal f[n], n ~ 0,1, . . . , N — 1. 
A time-frequency transform decomposes / over a family of 
time-frequency atoms {gi.k}i,k where I and k are the time 
and frequency (or scale) localization indices. The resulting 
coefficients shall be written: 



F[l,k] = {f,gi,k)=yj[n]9!,[n 
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where * denotes the conjugate. Short-time Fourier trans- 
form is most commonly used in audio processing and recog- 
nition ifTTl Is). Short-time Fourier atoms can be written: 
5z,feN = w[n — Zw] exp (^^^^), where w[n] is a Ban- 
ning window of support size K, which is shifted with a step 
u < K. I and k are respectively the integer time and fre- 
quency indices with < ^ < N /u and < fc < A'. 

The time-frequency representation provides a good do- 
main for audio classification for several reasons. First, of 
course, as the time-frequency transform is invertible, the time- 
frequency representation contains complete information of 
the audio signal. More importantly, the texture-like time- 
frequency representations usually contain distinctive patterns 
that capture different characteristics of the audio signals. Let 
us take the spectrograms of sounds of musical instruments as 
illustrated in Fig. |2]for example. Trumpet sounds often con- 
tain clear onsets and stable harmonics, resulting in clean ver- 
tical and horizontal structures in the time-frequency plane. Pi- 
ano recordings are also rich in clear onsets and stable harmon- 
ics, but they contain more chords and the tones tend to tran- 
sit fluidly, making the vertical and horizontal time-frequency 



structures denser Flute pieces are usually soft and smooth. 
Their time-frequency representations contain hardly any ver- 
tical structures, and the horizontal structures include rapid vi- 
brations. Such textural properties can be easily learned with- 
out explicit detailed analysis of the corresponding patterns. 

As human perception of sound intensity is logarithmic ll20ll . 
the classification is based on log-spectrogram 



S[l,k]^\og\F[l,k]\. 



(2) 
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Fig. 2. Log-spectrograms of solo phrases of different musical 
instruments. 

2.2. Feature Extraction 

Assume that one has learned M time-frequency blocks B„i of 
size Wm X L„i, each block containing some time-frequency 
structures of audio signals of various types. To characterize an 



audio signal, the algorithm first matches its log-spectrogram 
S with the sliding blocks B„i, Vm = 1, ■ ■ • , M, 
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(3) 

E[l,k,m] measures the degree of resemblance between the 
patch B„i and log-spectrogram S at position [l,k]. A min- 
imum operation is then performed on the map E[l,k,m] to 
extract the highest degree of resemblance locally between S 
and 

C[m]=-nimE[l,k,m]. (4) 

The coefficients C[to], m — 1, . . . , M, are time-frequency 
translation invariant. They constitute a feature vector {C[to]} 
of size M of the audio signal. Let us note that the block 
matching operation ^ can be implemented fast by convo- 
lution. 

The feature coefficient C[m] is expected to be discrimi- 
native if the time-frequency block i?^ contains some salient 
time-frequency structures. In this paper, we apply a simple 
random sampling strategy to leam the blocks as in lfT4l[T9]| : 
each block is extracted at a random position from the log- 
spectrogram 5 of a randomly selected training audio sample. 
Blocks of various sizes are applied to capture time-frequency 
structures at different orientations and scales |17|. Since au- 
dio time-frequency representations are rather stationary im- 
ages and often contain repetitive patterns, the random sam- 
pling learning is particularly efficient. Patterns that appear 
with high probability are likely to be learned. 

2.3. Classification 

The classification uses the minimum block matching energy 
C coefficients as features. While various classifiers such as 
S VMs can be used, a simple and robust nearest neighbor clas- 
sifier will be applied in the experiments. 

3. EXPERIMENTS AND RESULTS 

The audio classification scheme is evaluated through musi- 
cal instrument recognition. Solo phrases of eight instruments 
from different families, namely flute, trumpet, tuba, violin, 
cello, harpsichord, piano and drum, were considered. Mul- 
tiple instruments from the same family, violin and cello for 
example, were used to avoid over-simplification of the prob- 
lem. 

To prepare the experiments, great effort has been dedi- 
cated to collect data from divers sources with enough varia- 
tion, as few databases are publicly available. Sound samples 
were mainly excerpted from classical music CD recordings of 
personal collections. A few were collected from internet. For 
each instrument at least 822-second sounds were assembled 
from more than 1 1 recordings, as summarized in Table [T] All 



• Table 1. Sound database. Rec and Time are the number of 
recordings and the total time (second). Musical instruments 
from left to right: violin, cello, piano, harpsichord, trumpet, 
tuba, flute and drum. 

recordings were segmented into non-overlapping excerpts of 
5 seconds. 50 excerpts (250 seconds) per instrument are ran- 
domly selected to construct respectively the training and test 
data sets. The training and test data did not contain certainly 
the same excerpts. In order to avoid bias, excerpts from the 
same recording were never included in both the training set 
and the test set. 

Human sound recognition performance seems not degrade 
if the signals are sampled at 1 1000 Hz. Therefore signals were 
down-sampled to 11025 Hz to limit the computational load. 
Half overlapping Hanning windows of length 50 ms were ap- 
plied in the short-time Fourier transform. Time-frequency 
blocks of seven sizes 16 x 16, 16 x 8 and 8 x 16, 8 x 8, 
8x4 and 4x8 and 4x4 that cover time-frequency areas of 
size from 640Hz x 800ms to 160Hz x 200ms were simultane- 
ously used, same number for each, to capture time-frequency 
structures at different orientations and scales. The classifier 
was a simple nearest neighbor classification algorithm. 

Fig. |3] plots the average accuracy achieved by the algo- 
rithm in function of the number of features (which is seven 
times the number of blocks per block size). The performance 
rises rapidly to a reasonably good accuracy of 80% when 
the number of features increases to about 140. The accu- 
racy continues to improve slowly thereafter and becomes sta- 
ble at about 85%, very satisfactory, after the number of fea- 
tures goes over 350. Although this number of visual features 
looks much bigger than the number of carefully designed 
classical acoustic features (about 20) commonly used in lit- 
erature L6_, 5J, their computation is uniform and very fast. 

The confusion matrix in Table |2]reveals the classification 
details (with 420 features) of each instrument. The high- 
est confusion arrived between the harpsichord and the piano 
which can produce very similar sounds. Other pairs of instru- 
ments that produce potentially sounds of similar nature such 
as flute and violin were occasionally confused. Some trumpet 
excerpts were confused with violin and flute — these excerpts 
were found rather soft and contained mostly harmonics. The 
drum that is most distinct from the others had the lowest con- 
fusion rate. The average accuracy is 85.5%. 

4. CONCLUSION AND FUTURE WORK 

An audio classification algorithm is proposed, with spectro- 
grams of sounds treated as texture images. The algorithm 
is inspired by an earlier biologically-motivated visual classi- 
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Fig. 3. Average accuracy versus number of features. 





Vio. 


Cel. 


Pia. 


Hps. 


Tru. 


Tuba 


Flu. 


Drum 


Vio. 


94 











2 





4 





Cel. 





84 


6 


10 














Pia. 








86 


8 


6 











Hps. 








26 


74 














Tru. 


8 


2 


2 





80 





8 





Tuba 


2 


4 


2 








90 





2 


Flu. 


6 

















94 





Drum 

















2 





98 



Table 2. Confusion matrix. Each entry is the rate that the row 
instrument is classified as the column instrument. Musical 
instruments from top to bottom, left to right: violin, cello, 
piano, harpsichord, trumpet, tuba, flute and drum. 



fication scheme, particularly efficient at classifying textures. 
In experiments, this simple algorithm relying purely on time- 
frequency texture features achieves surprisingly good perfor- 
mance at musical instrument classification. 

In future work, such image features could be combined 
with more classical acoustic features. In particular, the still 
largely unsolved problem of instrument separation in poly- 
phonic music may be simplified using this new tool. 
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